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Effective cache prefetching on bus-based multiprocessors 
Dean M. Tullsen, Susan J. Eggers 

February 1995 ACM Transactions on Computer Systems (TOCS), volume 13 issue i 

Full text available* fi!) Ddf(2 30 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

Compiler-directed cache prefetching has the potential to hide much of the high memory 
latency seen by current and future high-performance processors. However, prefetching is 
not without costs, particularly on a shared-memory multiprocessor. Prefetching can 
negatively affect bus utilization, overall cache miss rates, memory latencies and data 
sharing. We simulate the effects of a compiler-directed prefetching algorithm, running on a 
range of bus-based multiprocessors. We show that, despite a ... 

Keywords: bus-based multiprocessors, cache prefetching, false sharing, memory latency 
hiding 



2 Architectural and compiler support for effective instruction prefetching: a cooperative 
a pproach 

February 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue l 

Full text available' "El pdf(432 96 KB) Ac,c,itjona, Information: full citation , abstract , references , citings, index 
• iaj : terms , review 

Instruction cache miss latency is becoming an increasingly important performance 
bottleneck, especially for commercial applications. Although instruction prefetching is an 
attractive technique for tolerating this latency, we find that existing prefetching schemes are 
insufficient for modern superscalar processors, since they fail to issue prefetches early 
enough (particularly for nonsequential accesses). To overcome these limitations, we propose 
a new instruction prefetching technique where ... 

Keywords: compiler optimization, instruction prefetching 



3 Effective jump - pointer prefetchin g for linked data structures 
Amir Roth, Gurindar S. Sohi 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
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full citation , abstract , references , citings , index 
terms 

Current techniques for prefetching linked data structures (LDS) exploit the work available in 
one loop iteration or recursive call to overlap pointer chasing latency. Jump pointers, which 
provide direct access to non-adjacent nodes, can be used for prefetching when loop and 
recursive procedure bodies are small and do not have sufficient work to overlap a long 
latency. This paper describes a framework for jump-pointer prefetching (JPP) that supports 
four prefetching idioms: queue, full, chain, an ... 
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4 Sim ple and effective array prefetching in Java 
Brendon Cahoon, Kathryn S. McKinley 

November 2002 Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande 

Full text available: ■ ffipdfd 18.49 KB) Additional information: full citation , abstract, references , citings, index 
: terms 

Java is becoming a viable choice for numerical algorithms due to the software engineering 
benefits of object-oriented programming. Because these programs still use large arrays that 
do not fit in the cache, they continue to suffer from poor memory performance. To hide 
memory latency, we describe a new unified compile-time analysis for software prefetching 
arrays and linked structures in Java. Our previous work uses data-flow analysis to discover 
linked data structure accesses, and here we presen ... 

Keywords: Java, array prefetching, memory optimization, static analysis 



Cooperative prefetching: compiler and hardware support for effective instruction 
prefetching in modern processors 
Chi-Keung Luk, Todd C. Mowry 

November 1998 Proceedings of the 31st annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: ^pdf(3.41 MB) Additional Information: full citation , references , citings , index terms 



6 An effective pro g rammable prefetch engine for on-chip caches 
Tien-Fu Chen 

December 1995 Proceedings of the 28th annual international symposium on 
Microarchitecture 

Full text available:^ pdf ( 721 .83 KB) Additional Information: full citation , references , citings, index terms 



Data prefetch mechanisms 
Steven P. Vanderwiel, David J. Lilja 

June 2000 ACM Computing Surveys (CSUR), volume 32 issue 2 

Full text available* Wi pdf(1 72.07 KB) Additional Information: full citation , abstract , references , citings, index 

terms , review 

The expanding gap between microprocessor and DRAM performance has necessitated the 
use of increasingly aggressive techniques designed to reduce or hide the latency of main 
memory access. Although large cache hierarchies have proven to be effective in reducing 
. this latency for the most frequently used data, it is still not uncommon for many programs 
to spend more than half their run times stalled on memory requests. Data prefetching has 
been proposed as a technique for hiding the access lat ... 

Keywords: memory latency, prefetching 
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8 S peeding up irregular applications in shared-memory multiprocessors: memory binding 
and group prefetching 

Zheng Zhang, Josep Torrellas 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture, Volume 23 issue 2 
Full text available: f£l pdf(1.74 MB) Additional Information: full citation , abstract , references , citings , index 

• terms 

While many parallel applications exhibit good spatial locality, other important codes in areas 
like graph problem-solving or CAD do not. Often, these irregular codes contain small records 
accessed via pointers. Consequently, while the former applications benefit from long cache 
lines, the latter prefer short lines. One good solution is to combine short lines with 
prefetching. In this way, each application can exploit the amount of spatial locality that it 
has. However, prefetching, if provided, ... 

9 Dependence based prefetching for linked data structures 
Amir Roth, Andreas Moshovos, Gurindar S. Sohi 

October 1998 Proceedings of the eighth international conference on Architectural 

support for programming languages and operating systems, volume 32 , 33 

Issue 5,11 

Full text available* - fgl pdf(1.81 MB) Additional Information: full citation , abstract , references , citings , index 

* ^ terms 

We introduce a dynamic scheme that captures the accesspat-terns of linked data structures 
and can be used to predict future accesses with high accuracy. Our technique exploits the 
dependence relationships that exist between loads that produce addresses and loads that 
consume these addresses. By identzj+ing producer-consumer pairs, we construct a compact 
internal representation for the associated structure and its traversal. To achieve a 
prefetching eflect, a small prefetch engine speculatively t ... 

1 0 Tolerating memory latency through push prefetchin g for pointer-intensive a p plications | 
Chia-Lin Yang, Alvin R. Lebeck, Hung-Wei Tseng, Chien-Hao Lee 

December 2004 ACM Transactions on Architecture and Code Optimization (TACO), volume 
1 Issue 4 

Full text available: ■g pdf(590.24 KB) Additional Information: full citation , abstract , references , index terms 

Prefetching is often used to overlap memory latency with computation for array-based 
applications. However, prefetching for pointer-intensive applications remains a challenge 
because of the irregular memory access pattern and pointer-chasing problem. In this paper, 
we proposed a cooperative hardware/ software prefetching framework, the push 
architecture, which is designed specifically for linked data structures. The push architecture 
exploits program structure for future address generation instea ... 

Keywords: Prefetch, linked data structures, memory hierarchy, pointer-chasing 



11 Limitations of cache prefetchin g on a bus-based multiprocessor 
Dean M. Tullsen, Susan J. Eggers 

May 1993 ACM SIGARCH Computer Architecture News , Proceedings of the 20th 

annual international symposium on Computer architecture, volume 21 issue 2 

Full text available* fiQ pdf(1 10 MB) Additional Information: full citation , abstract , references , citings , index 
• lAj-^-X- terms 

Compiler-directed cache prefetching has the potential to hide much of the high memory 
latency seen by current and future high-performance processors. However, prefetching is 
not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus 
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utilization, overall cache miss rates, memory latencies and data sharing. We simulated the 
effects. of a particular compiler-directed prefetching algorithm, running on a bus-based 
multiprocesssor. We showed that, despite a high mem ... 

12 Design and evaluation of a compiler algorithm for prefetching 
Todd C. Mowry, Monica S. Lam, Anoop Gupta 

September 1992 ACM SIGPLAN Notices , Proceedings of the fifth international 

conference on Architectural support for programming languages and 
operating systems, volume 27 issue 9 

Full text available: ^pdf(1.70 MB) Additional Information: full citation , references , citings , index terms 




13 Tolerating latency in multiprocessors through compiler-inserted prefetching 
Todd C. Mowry 

February 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue l 

Full text available: fijll pdf(410.70 KB) Addltional Information: full citation , abstract , references , citings , index 
^ : terms , review 

The large latency of memory accesses in large-scale shared-memory multiprocessors js a 
key obstacle to achieving high processor utilization. SoftwareTControlled prefetching is a 
technique for tolerating memory latency by explicitly executing instructions to move data 
close to the processor before the data are actually needed. To minimize the burden on the 
programmer, compiler support is needed to automatically insert prefetch instructions into 
the code. A key challenge when ... 

Keywords: compiler optimization, prefetching 



14 Clustered microarchitectures: Cluster prefetch: tolerating on-chip wire delays in 
clustered microarchitectures 
Rajeev Balasubramonian 

June 2004 Proceedings of the 18th annual international conference on 
Supercomputing 

Full text available: Qpdfd 53.44 KB) Additional Information: full citation , abstract , references , index terms 

The growing dominance of wire delays at future technology points renders a microprocessor 
communication-bound. Clustered microarchitectures allow most dependence chains to 
execute without being affected by long on-chip wire latencies. They also allow faster clock 
speeds and reduce design complexity, thereby emerging as a popular design choice for 
future microprocessors. However, a centralized data cache threatens to be the primary 
bottle-neck in highly clustered systems. The paper attempts to id ... 

Keywords: clustered microarchitectures, communication-bound processors, data prefetch, 
distributed caches, effective address and memory dependence prediction, processor 




15 Software prefetching for mark-sweep garbage collection: hardware analysis and 
software redesig n 

Chen-Yong Cher, Antony L. Hosking, T. N. Vijaykumar 

October 2004 Proceedings of the 11th international conference on Architectural 

support for programming languages and operating systems, Volume 38 , 39 , 

32 Issue 5 , 11 , 5 

Full text available: ^pdfd 65.32 KB) Additional Information: full citation , abstract , references , index terms 

Tracing garbage collectors traverse references from live program variables, transitively 
tracing out the closure of live objects. Memory accesses incurred during tracing are 
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essentially random: a given object may contain references to any other object. Since 
application heaps are typically much larger than hardware caches, tracing results in many 
cache misses. Technology trends will make cache misses more important, so tracing is a 
prime target for prefetching. Simulation of Java benchmarks runni ... 

Keywords: breadth -first, buffered prefetch, cache architecture, depth-first, garbage 
collection, mark-sweep, prefetch-on-grey, prefetching 



16 The interaction of software prefetching with ILP processors in shared-memory systems 
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, Sarita V. Adve 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, volume 25 issue 2 
Full text available* 1£| pdf(2.44 MB) Additional Information: full citation , abstract , references , citings, index 
' ^ terms 

Current microprocessors aggressively exploit instruction-level parallelism (ILP) through 
techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work 
has shown that memory latency remains a significant performance bottleneck for shared- 
memory multiprocessor systems built of such processors.This paper provides the first study 
of the effectiveness of software-controlled non-binding prefetching in shared memory 
multiprocessors built of state-of-the-art ILP-based proces ... 

17 Software prefetchin g j 
David Callahan, Ken Kennedy, Allan Porterfield 

April 1991 Proceedings of the fourth international conference on Architectural support 
for programming languages and operating systems, volume 19 , 25 , 26 issue 2 , 
Special Issue , 4 

Full text available: fflpdf(1.25 MB) Additional Information: full citation , references , citings, index terms 



18 A general framework for prefetch scheduling in linked data structures and its 
a pplication to multi-chain prefetchin g 

Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeung 
May 2004 ACM Transactions on Computer Systems (TOCS), volume 22 issue 2 

Full text available: ^ pdf(2.45 MB) Additional Information: full citation , abstract , references , index terms 

Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article investigates exploiting such interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 

Keywords: Data prefetching, memory parallelism, pointer-chasing code 




19 Call graph prefetching for database applications 
Murali Annavaram, Jignesh M. Patel, Edward S. Davidson 

November 2003 ACM Transactions on Computer Systems (TOCS), volume 21 issue 4 
Full text available: ^| pdf(701.71 KB) Additional Information: full citation , abstract , references , index terms 

With the continuing technological trend of ever cheaper and larger memory, most data sets 
in database servers will soon be able to reside in main memory. In this configuration, the 
performance bottleneck is likely to be the gap between the processing speed of the CPU and 
the memory access latency. Previous work has shown that database applications have large 
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instruction and data footprints and hence do not use processor caches effectively. In this 
paper, we propose Call Graph Prefetching (CGP), ... 

Keywords: Instruction cache prefetching, call graph, database 



20 Wrong-path instruction prefetching 
Jim Pierce, Trevor Mudge 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available* fi3 pdf(1.25 MB) Additional Information: full citation , abstract , references , citings, index 

terms 

Instruction cache misses can severely limit the performance of both superscalar processors 
and high speed sequential machines. Instruction prefetch algorithms attempt to reduce the 
performance degradation by bringing lines into the instruction cache before they are needed 
by the CPU fetch unit. There have been several algorithms proposed to do this, most notably 
next line prefetching and target prefetching. We propose a new scheme called wrong-path 
prefetching which combines next-line prefetchin ... 
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1 Caching & file systems: The performance impact of kernel prefetching on buffer cache 
replacement algorithms 
Ali R. Butt, Chris Gniady, Y. Charlie Hu 

June 2005 Proceedings of the 2005 ACM SIGMETRICS international conference on 
Measurement and modeling of computer systems 

Full text available: ^ pdf(303.47 KB) Additional Information: full citation , abstract , references , index terms 

A fundamental challenge in improving the file system performance is to design effective 
block replacement algorithms to minimize buffer cache misses. Despite the well-known 
interactions between prefetching and caching, almost all buffer cache replacement 
. algorithms have been proposed and studied comparatively without taking into account file 
system prefetching which exists in all modern operating systems. This paper shows that 
such kernel prefetching can have a significant impact on the relative ... 



Keywords: buffer caching, prefetching, replacement algorithms 

2 Architectural and compiler support for effective instruction prefetchin g : a cooperative Q 
approach 

February 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue l 

Full text available: I B pdf(432.96 KB) Additional Information: full citation , abstract, references, citings, index 
^ terms , review 

Instruction cache miss latency is becoming an increasingly important performance 
bottleneck, especially for commercial applications. Although instruction prefetching is an 
attractive technique for tolerating this latency, we find that existing prefetching schemes are 
insufficient for modern superscalar processors, since they fail to issue prefetches early 
enough (particularly for nonsequential accesses). To overcome these limitations, we propose 
a new instruction prefetching technique where ... 

Keywords: compiler optimization, instruction prefetching 
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Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, Sarita V. Adve 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, volume 25 issue 2 
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Current microprocessors aggressively exploit instruction-level parallelism (ILP) through 
techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work 
has shown that memory latency remains a significant performance bottleneck for shared- 
memory multiprocessor systems built of such processors.This paper provides the first study 
of the effectiveness of software-controlled non-binding prefetching in shared memory 
multiprocessors built of state-of-the-art ILP-based proces ... 

4 Reducing the impact of software prefetching on register pressure j 
David W. Shrewsbury, Cindy Norris 

March 2000 Proceedings of the 2000 ACM symposium on Applied computing - Volume 2 

Full text available: ^ pdf(492.51 KB) Additional Information: full citation , references , citings , index terms 



5 Performance comparison of thrashing control policies for concurrent Mergesorts with 
parallel prefetching 

Kun-Lung Wu, Philip S. Yu, James Z. Teng 

June 1993 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
1993 ACM SIGMETRICS conference on Measurement and modeling of 
computer systems, volume 21 issue 1 

Full text available* IS gdf{1 23 MB) Additional Information: full citation , abstract , references , citings, index 

terms 

We study the performance of various run-time thrashing control policies for the merge 
phase of concurrent mergesorts using parallel prefetching, where initial sorted runs are 
stored on multiple disks and the final sorted run is written back to another dedicated disk. 
Parallel prefetching via multiple disks can be attractive in reducing the response times for 
concurrent mergesorts. However, severe thrashing may develop due to imbalances between 
input and output rates, thus a large number o ... 

6 Evaluating the impact of memory system performance on software prefetching and 
locality o ptimizations 

Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, Chau-Wen Tseng 

June 2001 Proceedings of the 15th international conference on Supercomputing 

Full text available: f3 odf(387.06 KB) Additional Information: full citation , abstract, references , citings, index 

terms 

Software prefetching and locality optimizations are techniques for overcoming the speed gap 
between processor and memory. In this paper, we evaluate the impact of memory trends on 
the effectiveness of software prefetching and locality optimizations for three types of 
applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. 
We find for many applications, software prefetching outperforms locality optimizations when 
there is sufficient memory bandwidth, but ... 

7 Cooperative prefetching: compiler and hardware support for effective instruction 
prefetching in modern processors 

Chi-Keung Luk, Todd C. Mowry 

November 1998 Proceedings of the 31st annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: *^ pdf(3.41 MB) Additional Information: full citation , references , citings, index terms 
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ap plication to multi-chain prefetching 

Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeung 
May 2004 ACM Transactions on Computer Systems (TOCS), volume 22 issue 2 

Full text available: ^ pdf(2.45 MB) Additional Information: full citation , abstract , references , index terms 

Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article investigates exploiting such interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 

Keywords: Data prefetching, memory parallelism, pointer-chasing code 



9 Tolerating latency in multiprocessors through compiler-inserted prefetchin g 
Todd C. Mowry 

February 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue 1 

Full text available* IT) pdf(410 70 KB) Additional Information: full citation , abstract , references , citings , index 
' ^ terms , review 

The large latency of memory accesses in large-scale shared-memory multiprocessors is a 
key obstacle to achieving high processor utilization. Software-controlled prefetching is a 
technique for tolerating memory latency by explicitly executing instructions to move data 
close to the processor before the data are actually needed. To minimize the burden on the 
programmer, compiler support is needed to automatically insert prefetch instructions into 
the code. A key challenge when ... 

Keywords: compiler optimization, prefetching 



10 S peeding up irre g ular applications in shared-memory multiprocessors: memory bindin g 
and group prefetching 

Zheng Zhang, Josep Torrellas 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture, volume 23 issue 2 

Full text available* IS Ddf( 1 74 MB) Additional Information: full citation , abstract , references , citings , index 
'TB- 9 —^ terms 

While many parallel applications exhibit good spatial locality, other important codes in areas 
like graph problem-solving or CAD do not. Often, these irregular codes contain small records 
accessed via pointers. Consequently, while the former applications benefit from long cache 
lines, the latter prefer short lines. One good solution is to combine short lines with 
prefetching. In this way, each application can exploit the amount of spatial locality that it 
has. However, prefetching, if provided, ... 

11 The performance impact of I/O optimizations and disk improvements 
W. Hsu, A. J. Smith 

March 2004 IBM Journal of Research and Development volume 48 issue 2 
Additional Information: full citation , abstract , references , index terms 

In this paper, we use real server and personal computer workloads to systematically analyze 
the true performance impact of various I/O optimization techniques, including read caching, 
sequential prefetching, opportunistic prefetching, write buffering, request scheduling, 
striping, and short-stroking. We also break down disk technology improvement into four 
basic effects— faster seeks, higher RPM, linear density improvement, and increase in track 
density— and analyze each separately to determi ... 
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12 Compiler-based I/O prefetching for out-of-core applications 
Angela Demke Brown, Todd C Mowry, Orran Krieger 

May 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 2 

Full text available: f 9pdf(499.03 KB) Additional Information: full citation , abstract, references , citings, index 

terms , review 

Current operating systems offer poor performance when a numeric application's working set 
does not fit in main memory. As a result, programmers who wish to solve "out-of-core" 
problems efficiently are typically faced with the onerous task of rewriting an application to 
use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully 
automatic technique which liberates the programmer from this task, provides high 
performance, and requires only minima ... 

Keywords: compiler optimization, prefetching, virtual memory 

13 Informed prefetching and caching 

R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka 

December 1995 ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth 

ACM symposium on Operating systems principles, Volume 29 issue 5 
Full text available: ^| pdf(2.13 MB) Additional Information: full citation , references , citings , index terms 



14 Prefetchin g for improved bus wrapper performance in cores 
Roman Lysecky, Frank Vahid 

January 2002 ACM Transactions on Design Automation of Electronic Systems 
(TODAES), Volume 7 Issue 1 

Full text available: fi3 pdf (430.83 KB) Addit ' onal Information: full citation , abstract , references , citings, index 
. [Aj • terms 

Reuse of cores can reduce design time for systems-on-a-chip. Such reuse is dependent on 
being able to easily interface a core to any bus. To enable such interfacing, many propose 
separating a core's interface from its internals by using a bus wrapper. However, this 
separation can lead to a performance penalty when reading a core's internal registers. In 
this paper, we introduce prefetching, which is analogous to caching, as a technique to 
reduce or eliminate this performance penalty, involving a ... 

Keywords: Bus wrapper, PVCI, VSIA, cores, design reuse, intellectual property, interfacing, 
on-chip bus, system-on-a-chip 



15 Predictive Prefetching on the Web and Its Potential Impact in the Wide Area 
Christos Bouras, Agisilaos Konidaris, Dionysios Kostoulas 
June 2004 World Wide Web, Volume 7 issue 2 

Full text available: || Pub|jsher Sjte Additional Information: full citation , abstract , index terms 

The rapid increase of World Wide Web users and the development of services with high 
bandwidth requirements have caused the substantial increase of response times for users on 
the Internet. Web latency would be significantly reduced, if browser, proxy or Web server 
software could make predictions about the pages that a user is most likely to request next, 
while the user is viewing the current page, and prefetch their content. 

In this paper we study Predictive Prefetching on a totally ne ... 
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16 Configuration prefetch for single context reconfigurable coprocessors 
Scott Hauck 

March 1998 Proceedings of the 1998 ACM/SIGDA sixth international symposium on 
Field programmable gate arrays 

Full text available* 153 Ddfd 14 MB) Additional Information: full citation , abstract , references , citings , index 
" T^- 6 — terms 

Current reconfigurable systems suffer from a significant overhead due to the time it takes to 
reconfigure their hardware. In order to deal with this overhead, and increase the power of 
reconfigurable systems, it is important to develop hardware and software systems to reduce 
or eliminate this delay. In this paper we propose one technique for significantly reducing the 
reconfiguration latency: the prefetching of configurations. By loading a configuration into the 
reconfigurable logic ... 

17 Instruction prefetching of systems codes with layout optimized for reduced cache 
misses 

Chun Xia, Josep Torrellas 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture, volume 24 issue 2 
Full text available* fi3 odfd 65 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

High-performing on-chip instruction caches are crucial to keep fast processors busy. 
Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches 
in loop-intensive engineering codes, they are less able to do so in large systems codes. To 
improve the performance of the latter codes, the compiler can be used to lay out the code in 
memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a 
state that can be exploited by a new type of ... 

1 8 The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization j 
System 

Jiwei Lu, Howard Chen, Rao Fu, Wei-Chung Hsu, Bobbie Othmer, Pen-Chung Yew, Dong-Yuan 
Chen 

December 2003 Proceedings of the 36th annual IEEE/ACM International Symposium on 
M i c roa r c h i t ect u re 

Full text available: ^ pdf(253.79 KB) Additional Information: full citation , abstract , citings , index terms 

Traditional software controlled data cache prefetching isoften ineffective due to the lack of 
runtime cache miss andmiss address information. To overcome this limitation, weimplement 
runtime data cache prefetching in the dynamicoptimization system ADORE (ADaptive Object 
code RE-optimization).Its performance has been compared withstatic software prefetching 
on the SPEC2000 benchmarksuite. Runtime cache prefetching shows better performance. On 
an Itanium 2 based Linux workstation, it can increasepe ... 

19 Limits on the performance benefits of multithreadin g and prefetching j 
Beng-Hong Lim, Ricardo Bianchini 

May 1996 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 1996 
ACM SIGMETRICS international conference on Measurement and modeling 
of computer systems, volume 24 issue l 
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This paper presents new analytical models of the performance benefits of multithreading and 
prefetching, and experimental measurements of parallel applications on the MIT Alewife 
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multiprocessor. For the first time, both techniques are evaluated on a real machine as 
opposed to simulations. The models determine the region in the parameter space where the 
techniques are most effective, while the measurements determine the region where the 
applications lie. We find that these regions do not always o ... 

20 Tolerating memory latency through push prefetching for pointer-intensive applications jjj 
Chia-Lin Yang, Alvin R. Lebeck, Hung-Wei Tseng, Chien-Hao Lee 

December 2004 ACM Transactions on Architecture and Code Optimization (TACO), volume 
1 Issue 4 

Full text available: ^| pdf(590.24 KB) Additional Information: full citation , abstract , references , index terms 

Prefetching is often used to overlap memory latency with computation for array-based 
applications. However, prefetching for pointer-intensive applications remains a challenge 
because of the irregular memory access pattern and pointer-chasing problem. In this paper, 
we proposed a cooperative hardware/software prefetching framework, the push 
architecture, which is designed specifically for linked data structures. The push architecture 
exploits program structure for future address generation instea ... 

Keywords: Prefetch, linked data structures, memory hierarchy, pointer-chasing 
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